NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

If At First You Don’t Succeed, Try, Try, Again...? Insights and LLM-informed Tooling for Detecting Retry Bugs in Software Systems

https://doi.org/10.1145/3694715.3695971

Stoica, Bogdan Alexandru; Sethi, Utsav; Su, Yiming; Zhou, Cyrus; Lu, Shan; Mace, Jonathan; Musuvathi, Madanlal; Nath, Suman (November 2024, ACM)

Full Text Available
WAFFLE: Exposing Memory Ordering Bugs Efficiently with Active Delay Injection

https://doi.org/10.1145/3552326.3567507

Stoica, Bogdan Alexandru; Lu, Shan; Musuvathi, Madanlal; Nath, Suman (May 2023, EuroSys)

Full Text Available
Cancellation in Systems: An Empirical Study of Task Cancellation Patterns and Failures

Sethi, Utsav; Pan, Haochen; Lu, Shan; Musuvathi, Madanlal; Nath, Suman (July 2022, Proceedings of the 16th USENIX Symposium on Operating Systems)

Full Text Available
Breaking the computation and communication abstraction barrier in distributed machine learning workloads

https://doi.org/10.1145/3503222.3507778

Jangda, Abhinav; Huang, Jun; Liu, Guodong; Sabet, Amir Hossein; Maleki, Saeed; Miao, Youshan; Musuvathi, Madanlal; Mytkowicz, Todd; Saarikivi, Olli (February 2022, ASPLOS 2022: Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems)

Recent trends towards large machine learning models require both training and inference tasks to be distributed. Considering the huge cost of training these models, it is imperative to unlock optimizations in computation and communication to obtain best performance. However, the current logical separation between computation and communication kernels in machine learning frameworks misses optimization opportunities across this barrier. Breaking this abstraction can provide many optimizations to improve the performance of distributed workloads. However, manually applying these optimizations requires modifying the underlying computation and communication libraries for each scenario, which is both time consuming and error-prone. Therefore, we present CoCoNet, which contains (i) a domain specific language to express a distributed machine learning program in the form of computation and communication operations, (ii) a set of semantics preserving transformations to optimize the program, and (iii) a compiler to generate jointly optimized communication and computation GPU kernels. Providing both computation and communication as first class constructs allows users to work on a high-level abstraction and apply powerful optimizations, such as fusion or overlapping of communication and computation. CoCoNet enabled us to optimize data-, model- and pipeline-parallel workloads in large language models with only a few lines of code. Our experiments show that CoCoNet significantly outperforms state-of-the-art distributed machine learning implementations.
more » « less
Full Text Available
BigTest: A Symbolic Execution Based Systematic Test Generation Tool for Apache Spark

https://doi.org/10.1145/3377812.3382145

Gulzar, Muhammad Ali; Musuvathi, Madanlal; Kim, Miryung (June 2020, Proceedings of 42nd IEEE/ACM International Conference on Software Engineering)

Full Text Available
White-box testing of big data analytics with complex user-defined functions

https://doi.org/10.1145/3338906.3338953

Gulzar, Muhammad Ali; Mardani, Shaghayegh; Musuvathi, Madanlal; Kim, Miryung (January 2019, ESEC/FSE 2019: Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering)

Data-intensive scalable computing (DISC) systems such as Google’s MapReduce, Apache Hadoop, and Apache Spark are being leveraged to process massive quantities of data in the cloud. Modern DISC applications pose new challenges in exhaustive, automatic testing because they consist of dataflow operators, and complex user-defined functions (UDF) are prevalent unlike SQL queries. We design a new white-box testing approach, called BigTest to reason about the internal semantics of UDFs in tandem with the equivalence classes created by each dataflow and relational operator. Our evaluation shows that, despite ultra-large scale input data size, real world DISC applications are often significantly skewed and inadequate in terms of test coverage, leaving 34% of Joint Dataflow and UDF (JDU) paths untested. BigTest shows the potential to minimize data size for local testing by 10^5 to 10^8 orders of magnitude while revealing 2X more manually-injected faults than the previous approach. Our experiment shows that only few of the data records (order of tens) are actually required to achieve the same JDU coverage as the entire production data. The reduction in test data also provides CPU time saving of 194X on average, demonstrating that interactive and fast local testing is feasible for big data analytics, obviating the need to test applications on huge production data.
more » « less
Full Text Available

Search for: All records